4 research outputs found
Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models
In recent years, much progress has been made in learning robotic manipulation
policies that follow natural language instructions. Such methods typically
learn from corpora of robot-language data that was either collected with
specific tasks in mind or expensively re-labelled by humans with rich language
descriptions in hindsight. Recently, large-scale pretrained vision-language
models (VLMs) like CLIP or ViLD have been applied to robotics for learning
representations and scene descriptors. Can these pretrained models serve as
automatic labelers for robot data, effectively importing Internet-scale
knowledge into existing datasets to make them useful even for tasks that are
not reflected in their ground truth annotations? To accomplish this, we
introduce Data-driven Instruction Augmentation for Language-conditioned control
(DIAL): we utilize semi-supervised language labels leveraging the semantic
understanding of CLIP to propagate knowledge onto large datasets of unlabelled
demonstration data and then train language-conditioned policies on the
augmented datasets. This method enables cheaper acquisition of useful language
descriptions compared to expensive human labels, allowing for more efficient
label coverage of large-scale datasets. We apply DIAL to a challenging
real-world robotic manipulation domain where 96.5% of the 80,000 demonstrations
do not contain crowd-sourced language annotations. DIAL enables imitation
learning policies to acquire new capabilities and generalize to 60 novel
instructions unseen in the original dataset
Visuomotor Control in Multi-Object Scenes Using Object-Aware Representations
Perceptual understanding of the scene and the relationship between its
different components is important for successful completion of robotic tasks.
Representation learning has been shown to be a powerful technique for this, but
most of the current methodologies learn task specific representations that do
not necessarily transfer well to other tasks. Furthermore, representations
learned by supervised methods require large labeled datasets for each task that
are expensive to collect in the real world. Using self-supervised learning to
obtain representations from unlabeled data can mitigate this problem. However,
current self-supervised representation learning methods are mostly object
agnostic, and we demonstrate that the resulting representations are
insufficient for general purpose robotics tasks as they fail to capture the
complexity of scenes with many components. In this paper, we explore the
effectiveness of using object-aware representation learning techniques for
robotic tasks. Our self-supervised representations are learned by observing the
agent freely interacting with different parts of the environment and is queried
in two different settings: (i) policy learning and (ii) object location
prediction. We show that our model learns control policies in a
sample-efficient manner and outperforms state-of-the-art object agnostic
techniques as well as methods trained on raw RGB images. Our results show a 20
percent increase in performance in low data regimes (1000 trajectories) in
policy training using implicit behavioral cloning (IBC). Furthermore, our
method outperforms the baselines for the task of object localization in
multi-object scenes